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Abstract - A high-speed wide-range parallel counter that 
achieves high operating frequencies through a novel pipeline 
partitioning methodology (a counting path and state look- 
ahead path) is proposed, and can be implemented using only 
three simple repeated CMOS-logic module types: an initial 
module generates anticipated counting states for higher 
significant bit modules through the state look-ahead path, 
simple D-type flip-flops, and 2-bit counters. The state look- 
ahead path prepares the counting path's next counter state 
prior to the clock edge such that the clock edge triggers all 
modules simultaneously, thus concurrently updating the 
count state with a uniform delay at all counting path 
modules/stages with respect to the clock edge. The structure 
is scalable to arbitrary N-bit counter widths (2-to-2N range) 
using only the three module types and no fan-in or fan-out 
increase. The counter's delay is comprised of the initial 
module access time (a simple 2 -bit counting stage), one 
three-input AND-gate delay, and a D-type flip-flop setup- 
hold time. Thus the proposed counter can be implemented 
without AND gate and hence speed can be increased. The 
design can be implemented with Modelsim simulator. The 
parallel counter can give a maximum operating speed of 
2GHz for 8-bit counter. Finally, the area of a sample 8-bit 
counter is 78 125 jam (510 transistors) and power 
consumption is 13.89Mw at 2GHz. 

Keywords - High performance Counter design, Parallel 
counter design, Pipeline counter design 



I. INTRODUCTION 

Counters are widely used as essential building blocks for a 
variety of circuit operations such as programmable frequency 
dividers, shifters, code generators, memory select 
management, and various arithmetic operations. Since many 
applications are comprised of these fundamental operations, 
much research focuses on efficient counter architecture design. 
Counter architecture design methodologies explore tradeoffs 
between operating frequency, power consumption, area 
requirements, and target application specialization. 

In this paper the counter operating frequency can be 
increased using a novel parallel counting architecture in 
conjunction with a state look-ahead path and pipelining to 
eliminate the carry chain delay and reduce AND gate fan-in 
and fan-out. The state look-ahead path bridges the anticipated 
overflow states to the counting modules, which are exploited 
in the counting path. 

The counting modules are partitioned into smaller 2 -bit 
counting modules separated by pipelined DFF latches. The 
state look-ahead path is partitioned using the same pipelined 
alignment paradigm as the counting path and thereby provides 



the correct anticipated overflow states for all counting stages. 
Subsequently, all counting states and all DFFs are triggered 
concurrently on the clock edge, enabling the count state in 
modules of higher significance to be anticipated by the count 
state in modules of lower significance. 

This cooperation between the counting path and state look- 
ahead paths enables every counting module (both low and high 
significance) to be triggered concurrently on the clock edge 
without any rippling effect. The AND gate delay can be 
replaced by the use of flip flops. 

The merits of the proposed parallel counter are 

1) A single clock input triggers all counting modules 
simultaneously, resulting in an operating frequency 
independent of counter width (assuming ideal parasitic 
capacitance on the clock wire path, without loss of generality). 
The total critical path delay (regardless of counter width) is 
uniform at all counting stages and is equal to the combination 
of the access time of a 2-bit counting module, and the DFF 
setup-hold time. 

2) The parallel counter architecture leverages modularity, 
which enables high flexibility and reusability, and thus enables 
short design time for wide counter applications. The 
architecture is composed of three basic module types separated 
by DFFs in a pipelined organization. These three module types 
are placed in a highly repetitious structure in both the counting 
path and the state look-ahead paths, which limit localized 
connections to only three signals (thus, fan-in and fan-out). 

3) The counter output is in radix-2 representation so the 
count value can be read on-the- fly with no additional logic 
decoding. 

4) Unlike previous parallel counter designs that have count 
latencies of two or three cycles, depending on the counter 
width, the parallel counter has no count latency, which enables 
the count value to be read on-the-fly. 

II. RELATED WORKS 

Counter architecture design methodologies explore 
tradeoffs between operating frequency, power consumption, 
area requirements, and target application specialization. 

Early design methodologies [4] improved counter 
operating frequency by partitioning large counters into 
multiple smaller counting modules, such that modules of 
higher significance (containing higher significant bits) were 
enabled when all bits in all modules of lower significance 
(containing lower significant bits) saturate. Initializations and 
propagation delays such as register load time, AND logic 
chain decoding, and the half incrementer component delays in 
half adders dictated operating frequency. Subsequent 
methodologies [15], [22] improved counter operating 
frequency using half adders in the parallel counting modules 
that enabled carry signals generated at counting modules of 
lower significance to serve as the count enable for counting 
modules of higher significance, essentially implementing a 
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carry chain from modules of lower significance to modules of 
higher significance. The carry chain cascaded synchronously 
through intermediate D-type flip-flops (DFFs). The maximum 
operating frequency was limited by the half adder module 
delay, DFF access time, and the detector logic delay. Since the 
module outputs did not directly represent count state, the 
detector logic further decoded the module outputs to the 
outputted count state value. 

Further enhancements [27] improved operating frequency 
using multiple parallel counting modules separated by DFFs in 
a pipelined structure. The counting modules were composed of 
an incrementer that was based on a carry -ripple adder with one 
input hardcoded to "1" [22]. In this design, counting modules 
of higher significance contained more cascaded carry-ripple 
adders than counting modules of lower significance. Each 
counting module's count enable signal was the logical AND of 
the carry signals from all the previous counting modules (all 
counting modules of lower significance), thus prescaling 
clocked modules of higher significance using a low frequency 
signal derived from modules of lower significance. Due to this 
prescaling architecture, the maximum operating frequency was 
limited by the incrementer, DFF access time, and the AND 
gate delay. The AND gate delay could potentially be large for 
large sized counters due to large fan-in and fan-out parasitic 
components. Design modifications enhanced AND gate delay, 
and subsequently operating frequency, by redistributing the 
AND gates to a smaller fan-in and fan-out layout separated by 
latches. However, the drawback of this redistribution was 
increased count latency (number of clock cycles required 
before the output of the first count value). In addition, due to 
the design structure, this counter architecture inherited an 
irregular VLSI layout structure and resulted in a large area 
overhead. Hoppe et al. [8] improved counter operating 
frequency by incorporating a 2-bit Johnson counter [12] into 
the initial counting module (least significant) in a partitioned 
counter architecture. However, the increase in operating 
frequency was offset by reduced counting capability. 

In Hoppe 's design, counting modules of higher 
significance were constructed of standard synchronous 
counters triggered by the Johnson counter and additional 
synchronization logic. However, the synchronization circuit 
and initial module still limited the operating frequency and 
resulted in reduced applicability. 

Kakarountas et al. [11] used a carry look-ahead circuit [6] 
to replace the carry chain. The carry look-ahead circuit used a 
prescaler technique with systolic 4 -bit counter modules [which 
used T-type flip-flops (TFFs)], with the cost of an extra 
detector circuit. The detector circuit detected the assertion of 
lower order bits to enable counting in the higher order bits. To 
further improve operating frequency, Kakarountas' s design 
used DFFs between systolic counter modules. The clock 
period was bounded by the delay of two input gates in addition 
to the TFF access and setup-hold time. Large counter widths 
incurred an additional three input logic gate delay. However, 
since the counter design was limited by control signal 
broadcasting, Kakarountas 's design was not practical for large 
counter widths even though the Xilinx Data Book [24] shows 
that several counter designs with the highest operating 
frequencies use prescaler techniques. In order to create a more 
efficient architecture for large counter widths and more 
amenability to a wider application range, counter architectures, 
such as up/down counters [18], [20], added extra (redundant) 
registers (while still using partitioned counter modules [4], 
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[22] to store the previous counter state during a counter state 
transition (counter increment). Thus, when the counting 
direction changed (from up to down or down to up), the 
contents of these count state registers determine the next 
counter state. 

Jones et al. [10] designed a counter specialized for 
applications with fast arithmetic operations [7], [17] using a 
half/full adder prefix structure. This prefix structure partially 
alleviated the cascading adder carry chain delay at the expense 
of a large area overhead. However, prefix structures are not 
practical for large counter widths due to an increase in the 
number of inputs, resulting in a large number of wide adders 
with large delays. Several modern counter designs are well 
suited to applications with various arithmetic operations, such 
as systolic counters and population counters. Systolic counters 
[16], [19] have high operating frequencies at the expense of 
representing the count value using two redundant binary 
numbers, which results in a large area overhead for state 
decoding. Population counters [13], [14], [23] and counting 
responders [5] provide high operating frequencies using the 
relationship between counter inputs and outputs based on 
listing all input bits (input vector length). 

Literature reports population counters as capacitive 
thresholds-logic gates [13], cascading trees of full/half adders 
[23], or a shift switch logic structure using an output decoding 
methodology [14]. Other modern counter designs target 
particular applications (such as combinatorial optimizations 
and image processing) using the "choose" counter (-counter) 
[9]. However, a logarithmic shift operation delay limits this 
counter design's applicability to only small and values. (A 
thorough literature review of large parallel counter designs can 
be found in [21]. Finally, alternative counter designs increase 
counter operating frequency using ratioed logic dynamic DFFs 
[3], [25], [26], but however these designs tended to have large 
area overheads making them not ideal for continued CMOS 
technology scaling. In order to reduce high counter power 
consumption, Alioto et al. [2] presented a low power counter 
design with a relatively high operating frequency. 

Alioto 's design was based on cascading an analog block 
(these analog blocks were structured using MOS current mode 
logic to represent an analog divider stage) such that each 
counting stage's (module's) input frequency was halved 
compared to the previous counting stage (module). However, 
Alioto 's counter design's carry chain rippled through all 
counting stages, resulting in a total critical path delay equal to 
the sum of all counting stage delays. Subsequently, Alioto 's 
design was not well suited for large counter widths because 
the carry chain limited operating frequency even though the 
carry chain voltage was not rail-to-rail. In addition, the counter 
circuit's continuous standby current required a device 
shutdown mechanism in order to regulate power consumption. 
Furthermore, the counter circuit's active margin was bounded 
by 1/3 of the supply voltage, which resulted in high design 
costs with current CMOS technologies that usually inherit low 
supply voltages. 

In this paper, the counter operating frequency is improved 
using a novel parallel counting architecture in conjunction 
with a state look-ahead path and pipelining to eliminate the 
carry chain delay and reduce AND gate fan-in and fan-out. 
The state look-ahead path bridges the anticipated overflow 
states to the counting modules, which are exploited in the 
counting path. 
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The counting modules are partitioned into smaller 2-bit 
counting modules separated by pipelined DFF latches. The 
state look-ahead path is partitioned using the same pipelined 
alignment paradigm as the counting path and thereby provides 
the correct anticipated overflow states for all counting stages. 
Subsequently, all counting states and all pipelined DFFs (in 
both paths) are triggered concurrently on the clock edge, 
enabling the count state in modules of higher significance to 
be anticipated by the count state in modules of lower 
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Fig. 1 Functional block diagram of 8 -bit parallel counter. The 
state look-ahead logic consists of all logic encompassed by the 
dashed box and the counting logic consists of all other logic 
(not encompassed by the dashed box). 

III. PARALLEL COUNTER ARCHITECTURE 

Fig. 1 represents the block diagram for 8-bit counter. The 
main structure consists of the state look-ahead path (all logic 
encompassed by the dashed box) and the counting path. The 
counter is constructed as a single mode counter, which 
sequences through a fixed set of preassigned count states, of 
which each next count state represents the next counter value 
in sequence. The counter is partitioned into uniform 2-bit 
synchronous up counting modules. Next state transitions in 
counting modules of higher significance are enabled on the 
clock cycle preceding the state transition using stimulus from 
the state look-ahead path. Therefore, all counting modules 
concurrently transition to their next states at the rising clock 
edge (CLKIN). 

ARCHITECTURAL FUNCTIONALITY 

The counting path's counting logic controls counting 
operations and the state look-ahead path's state look-ahead 
logic anticipates future states and thus prepares the counting 
path for these future states. Figure 1 shows the three module 
types (module- 1, module-2, and module-3 S, where S=(l,2,3) , 
etc. used to construct both paths. Module- 1 and module-3 are 
exclusive to the counting path and each module represents two 
counter bits. Module-2 is a conventional positive edge 
triggered DFF and is present in both paths. In the counting 
path, each module-3 S is preceded by an associated module-2. 
Module-3 S's serve two main purposes. Their first purpose is 
to generate all counter bits associated with their ordered 
position and the second purpose is to enable (in conjunction 
with stimulus from the state look-ahead path) future states in 
subsequent module-3 S's (higher S values) in conjunction with 
stimulus from the state look-ahead path. 
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COUNTING PATH 

Module- 1 is a standard parallel synchronous binary 2 -bit 
counter, which is responsible for low-order bit counting and 
generating future states for all module-3 S's in the counting 
path by pipelining the enable for these future states through 
the state look-ahead path. Fig 2 & Fig 3 depicts the hardware 
schematic and state diagram for module- 1. 



RES. 



-[> 



DFF 



CLKIN — <--.. 



DIN 

ujr< qk 



RES CLK 
t>o DIN QH 
DFF Q 



\ - QEN I 



J 



Fig 2 Module- 1 Hardware schematic 
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Fig 3 Module- 1 State diagram 
The placement of module-2s in the counting path is critical 
to the novelty of our counter structure. Module-2s in the 
counting path act as a pipeline between the module- 1 and 
module-3S land between subsequent module-3-S (see Fig. 1). 
Module-2 placement (coupled with state look-ahead logic) 
increases counter operating frequency by eliminating the 
lengthy AND-gate rippling and large AND gate fan-in and 
fan-out typically present in large width parallel counters. Thus, 
instead of the modules of higher significance requiring the 
AND of all enable signals from modules of lower significance, 
modules of higher significance (module-3 s in our design) are 
simply enabled by the module-3 S's preceding module-2 and 
state look-ahead logic. Thus, the module-2 s in the counting 
path provide a 1 -cycle look- ahead mechanism for triggering 
the module-3 S's, enabling the module-2s to maintain a 
constant delay for all stages and all module-3 's to count in 
parallel at the rising clock edge instead of waiting for the 
overflow rippling in a standard ripple counter. 

STATE LOOK- AHEAD PATH 

The state look-ahead path operates similarly to a carry 
look-ahead adder in that it decodes the low-order count states 
and carries this decoding over several clock cycles in order to 
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trigger high-order count states. The state look-ahead logic is 
principally equivalent to the one-cycle look-ahead mechanism 
in the counting path. 

Module-2s in the state look-ahead logic are responsible for 
propagating (pipelining) the early overflow detection to the 
appropriate module-3S. Early overflow is initiated by the 
module- 1 through the left-most column of decoders (state-2, 
state-3, etc.). 




Fig 4 Module-3-S Hardware schematic 



Fig. 4 & Fig 5 depicts the hardware schematic and state 
diagram for module-3 S. Module-3 S is a parallel synchronous 
binary 2-bit counter whose count is enabled by INS. INS 
connects to the Q output of the preceding module-2. Module-3 
S outputs Q1Q0 (which connect to the appropriate count 
output bits QX and Q(X-l) as shown in Fig. 1) and QEN3=Q1 
AND QO AND QC (the 3 in QEN3 denotes that this is the 
QEN for module-3 S ). The state look-ahead logic provides the 
QC input. QEN3 connects to the subsequent module-2' s DIN 
input and provides the one-cycle look ahead mechanism. 
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Fig 5 Module-3 -S State diagram 

IV. TIMING DIAGRAM 

Fig. 6 depicts the timing diagram for the sample 8 -bit 
counter in Fig.3.1, showing all related events, which occur for 
a start count state of 101000 
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Fig. 6 Timing diagram for the 8 -bit counter in Fig. 1 starting 
with an initial count value of 101000 and operating for seven 
subsequent count operations. Clock cycle counts are denoted 
along the top of the timing diagram. 
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Table 1. Symbol notation definitions for the timing diagram in 

fig.6 
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Fig. 7. Simulation waveforms for a synthesized HDL 
representation of the 8 -bit parallel counter 

VI. CONCLUSION 

In this paper, a scalable high-speed parallel counter using 
digital CMOS gate logic components is proposed. The counter 
design logic is comprised of only 2-bit counting modules and 
three-input AND gates. The counter structure's main features 
are a pipelined paradigm and state look-ahead path logic 
whose interoperation activates all modules concurrently at the 
system's clock edge, thus providing all counter state values at 
the exact same time without rippling affects. In addition, this 
structure avoids using a long chain detector circuit typically 
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required for large counter widths. An initial m-bit counting 
module pre-scales the counter size and this initial module is 
responsible for generating all early overflow states for 
modules of higher significance. In addition, this structure uses 
a regular VLSI topology, which is attractive for continued 
technology scaling due to two repeated module types (module - 
2s and module-3s) forming a pattern paradigm and no increase 
in fan-in or fan-out as the counter width increases, resulting in 
a uniform frequency delay that is attractive for parallel 
designs. 

Consequently, the counter frequency is greatly improved by 
reducing the gate count on all timing paths to two gates using 
advanced circuit design techniques. However, extra 
precautions must be considered during synthesis or layout 
implementations in order to aligned all modules in vertical 
columns with the system clock. This layout avoids setup and 
hold time violations, which might ultimately be limited by 
race conditions. Finally, the counter output is determined 
directly on-the-fly with no additional decoding latency 
necessary to decode the final output pattern as with most 
counter designs. 
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